R Terminology

Before we start, some terms/concepts are important to understand.

  1. Data types - Under the hood, R stores information in a few different ways:

    • Integer - These are whole numbers
    • Numeric - Any number (can be a whole number)
    • String - Letters/words
    • Factor - Letters/words that are stored as numbers (we will touch on this later)
    • Time - Dates/times
  2. Vectors - Vectors in R are akin to lists. All data is stored in a vector and can be accessed by index/position. Most vectors are created using c(“thing1”,“thing2”,…,“thingn”).

  3. Data frames/tibbles - Dataframes and tibbles are conceptually equivalent to an Excel sheet. They store data in rows and columns.
  4. The pipe ( %>% ) - The pipe is a special symbol that allows us to write cleaner code by saying “put the output from the left function into the right function.”

Packages

Installing packages

Packages provide a lot of the functionality that makes R so useful. I’ve already installed all of the packages necessary to run this tutorial but if you need to install them on your computer, just type

#install.packages("PACKAGE_NAME",dependencies = TRUE)

If asked to compile packages, try the “Yes” option and, if it fails, redo with the “no” option. For packages from Bioconductor, just Google how to install.

Loading Packages

First we load the core tidyverse package. This package includes:

  • readr - Reads in text files (.csv, .tsv)
  • tibble - Makes dataframes (data structure with rows and columns)
  • tidyr - Functions to clean up data
  • dplyr - For manipulating/processing data.
  • stringr - Functions to work with text
  • forcats - Functions to work with factors (more details to follow)
  • purrr - (Not covered) Helps replace for loops with easier to interpret code
  • ggplot2 - Make graphs/figures. We won’t cover this in detail but will use ggplot output.

We also load the ggrepel package that helps format ggplot output.

library(tidyverse) 
## ── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggrepel)

Data In and Out

There are a lot of ways to get data into R, we will go over the readr and tibble packages

Tibble

We can also manually create a tibble using a series of vectors. You simply specify the name of the column and the data corresponding to it.

OCH_tibble<-tibble::tibble(
  names=c("Claire","Cory","Rachael"),
  r_abil=c(10,5,8),
  height=c(165.1,187.96,158.75)
  )

print(OCH_tibble)
## # A tibble: 3 x 3
##   names   r_abil height
##   <chr>    <dbl>  <dbl>
## 1 Claire      10   165.
## 2 Cory         5   188.
## 3 Rachael      8   159.

Readr

We can also load data set using read_csv. This creates a dataframe/tibble object. Think of this as one sheet from an Excel file where data is stored in rows and columns. Readr also has other functions for reading different file types (e.g. read_tsv). We can also write our data to an output file using various write_ functions.

Gap_minder<-readr::read_csv("GM_code_along.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   continent = col_character(),
##   year = col_double(),
##   lifeExp = col_double(),
##   pop = col_double(),
##   gdpPercap = col_double()
## )
print(Gap_minder)
## # A tibble: 1,705 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,695 more rows
readr::write_csv(OCH_tibble,"OCH.csv")

Loading files with RStudio

Lastly, you can load files with RStudio. To do this, simply navigate to the “Files” tab in the bottom right corner, find your file, click on it and select “Import Dataset.” To save, use one of the write_ functions mentioned above

Dplyr

Next we will use functions from the dplyr package to manipulate our dataset.

Select

Select can be used to pick columns to include or exclude

# Only include country, continent, year, and lifeExp columns
Gap_minder%>%
  dplyr::select(c(country, continent, year, lifeExp))%>%
  colnames()
## [1] "country"   "continent" "year"      "lifeExp"
#Exclude gdpPercap column
Gap_minder%>%
  dplyr::select(-gdpPercap)%>%
  colnames()
## [1] "country"   "continent" "year"      "lifeExp"   "pop"
#Include columns that containt "co" (i.e. country and continent)
Gap_minder%>%
  dplyr::select(dplyr::contains(c("co")))%>%
  colnames()
## [1] "country"   "continent"
#Reorder columns so year is first
Gap_minder%>%
  dplyr::select(year,dplyr::everything())%>%
  colnames()
## [1] "year"      "country"   "continent" "lifeExp"   "pop"       "gdpPercap"
#Select can be used to rename columns while filtering,
#or rename can be used to rename columns in place
Gap_minder%>%
  select(year, selected=country)%>%
  colnames()
## [1] "year"     "selected"
Gap_minder%>%
  rename(renamed=country)%>%
  colnames()
## [1] "renamed"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

Filter

Filter is used to include or exclude rows based on some logical parameter. Say we just want to compare GDP data for European countries within a certain year

Europe_1992<-Gap_minder%>%
  dplyr::filter(dplyr::between(year, 1990,1994) & continent=="Europe")

ggplot2::ggplot(Europe_1992,aes(x=lifeExp,y=gdpPercap,fill=continent),colour="blue")+
  ggplot2::geom_point()+
  ggtitle("European Per Capita GDP 1992")+
  ggrepel::geom_text_repel(aes(x=lifeExp,y=gdpPercap,label=country))+
  ggplot2::xlim(c(65,80))+
  ggplot2::ylim(c(2000,35000))

You may also want to filter to compare two countries

Alb_France<-Gap_minder%>%
  filter( (country=="Albania" | country=="France") & year==1992)

ggplot(data=Alb_France,aes(x=lifeExp,y=gdpPercap,fill=continent),colour="Blue")+
  geom_point()+
  ggtitle("France & Albania Per Capita GDP 1992")+
  geom_text_repel(aes(x=lifeExp,y=gdpPercap,label=country))+
  xlim(c(65,80))+
  ylim(c(2000,35000))

Or identify and remove NA values

Gap_minder%>%
  filter(is.na(gdpPercap))
## # A tibble: 1 x 6
##   country  continent  year lifeExp   pop gdpPercap
##   <chr>    <chr>     <dbl>   <dbl> <dbl>     <dbl>
## 1 Coryland <NA>       2019      NA     1        NA
Gap_minder<- Gap_minder%>%
  filter(!is.na(gdpPercap))

#You could also use filter(country!="Coryland")

Mutate

Mutate is used to add new columns to a tibble, usually based on calculations involving and existing column.

Gap_minder%>%
  mutate(gdptotal=gdpPercap*pop/(10^9))
## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap gdptotal
##    <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>    <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.     6.57
##  2 Afghanistan Asia       1957    30.3  9240934      821.     7.59
##  3 Afghanistan Asia       1962    32.0 10267083      853.     8.76
##  4 Afghanistan Asia       1967    34.0 11537966      836.     9.65
##  5 Afghanistan Asia       1972    36.1 13079460      740.     9.68
##  6 Afghanistan Asia       1977    38.4 14880372      786.    11.7 
##  7 Afghanistan Asia       1982    39.9 12881816      978.    12.6 
##  8 Afghanistan Asia       1987    40.8 13867957      852.    11.8 
##  9 Afghanistan Asia       1992    41.7 16317921      649.    10.6 
## 10 Afghanistan Asia       1997    41.8 22227415      635.    14.1 
## # … with 1,694 more rows
Gap_minder%>%
  mutate(era=dplyr::if_else(year<2000,"20th","21st"))
## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap era  
##    <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl> <chr>
##  1 Afghanistan Asia       1952    28.8  8425333      779. 20th 
##  2 Afghanistan Asia       1957    30.3  9240934      821. 20th 
##  3 Afghanistan Asia       1962    32.0 10267083      853. 20th 
##  4 Afghanistan Asia       1967    34.0 11537966      836. 20th 
##  5 Afghanistan Asia       1972    36.1 13079460      740. 20th 
##  6 Afghanistan Asia       1977    38.4 14880372      786. 20th 
##  7 Afghanistan Asia       1982    39.9 12881816      978. 20th 
##  8 Afghanistan Asia       1987    40.8 13867957      852. 20th 
##  9 Afghanistan Asia       1992    41.7 16317921      649. 20th 
## 10 Afghanistan Asia       1997    41.8 22227415      635. 20th 
## # … with 1,694 more rows
Decades<-Gap_minder%>%
  mutate(era=dplyr::case_when(year<1960 ~ "50s",
                              year<1970 ~ "60s",
                              year<1980 ~ "70s",
                              year<1990 ~ "80s",
                              year<2000 ~ "90s",
                              year<2010 ~ "00s"))

ggplot(Decades,aes(x=era,y=lifeExp))+
  ggtitle("Life Expectancy by Decade")+
  geom_boxplot()+
  ggplot2::geom_jitter(aes(colour=continent,alpha=0.3),
              show.legend = FALSE)

# Check out mutate in the tidyverse ref manual for more useful functions such as
# lead and lag to grab the next or previous data point in a column or the cumulative
# series of functions

Grouping and summarizing

Often, you want a number summarizing particular groupings of data, e.g. what was the population increase in each country over the period observed? You can get this with group_by, summarize, and mutate

#Arrange sorts in ascending order using the variable(s) listed. Use desc(variable) to do the opposite.
#Group_by creates groups based on the levels in the column(s) you specify and then you can use summarize or
#mutate to manipulate data based on those groups. Mutate keeps the previous data structure, 
#summarize makes a new, smaller dataset with one calculated value for each group.

#Summarize creates a smaller, summary tibble.
population_increase<-Gap_minder%>%
  dplyr::arrange(year)%>%
  dplyr::group_by(country)%>%
  dplyr::summarize(
    pop_inc=dplyr::last(pop)-dplyr::first(pop),
    continent=unique(continent))

population_increase
## # A tibble: 142 x 3
##    country       pop_inc continent
##    <chr>           <dbl> <chr>    
##  1 Afghanistan  23464590 Asia     
##  2 Albania       2317826 Europe   
##  3 Algeria      24053691 Africa   
##  4 Angola        8188381 Africa   
##  5 Argentina    22424971 Americas 
##  6 Australia    11742964 Oceania  
##  7 Austria       1272011 Europe   
##  8 Bahrain        588126 Asia     
##  9 Bangladesh  103561480 Asia     
## 10 Belgium       1661821 Europe   
## # … with 132 more rows
ggplot(population_increase,aes(x=continent,y=pop_inc,fill=continent),colour="black")+
  ggtitle("Population Change 1952 to 2007")+
  geom_violin()

#If you use mutate with group_by you add the values to
#the existing tibble.
Gap_minder%>%
  dplyr::arrange(year)%>%
  dplyr::group_by(country)%>%
  mutate(pop_inc=dplyr::last(pop)-dplyr::first(pop))
## # A tibble: 1,704 x 7
## # Groups:   country [142]
##    country     continent  year lifeExp      pop gdpPercap   pop_inc
##    <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.  23464590
##  2 Albania     Europe     1952    55.2  1282697     1601.   2317826
##  3 Algeria     Africa     1952    43.1  9279525     2449.  24053691
##  4 Angola      Africa     1952    30.0  4232095     3521.   8188381
##  5 Argentina   Americas   1952    62.5 17876956     5911.  22424971
##  6 Australia   Oceania    1952    69.1  8691212    10040.  11742964
##  7 Austria     Europe     1952    66.8  6927772     6137.   1272011
##  8 Bahrain     Asia       1952    50.9   120447     9867.    588126
##  9 Bangladesh  Asia       1952    37.5 46886859      684. 103561480
## 10 Belgium     Europe     1952    68    8730405     8343.   1661821
## # … with 1,694 more rows
#You can group_by multiple variables
median_pop<-Gap_minder%>%
  dplyr::group_by(year, continent)%>%
  dplyr::summarize(med_pop=median(pop))

ggplot(median_pop,aes(x=year,y=med_pop,fill=continent,colour=continent))+
  ggtitle("Median Population by Year & Continent")+
  geom_point()+
  stat_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Join

You can use joins to merge two tibbles along shared column(s)

OCH_tibble
## # A tibble: 3 x 3
##   names   r_abil height
##   <chr>    <dbl>  <dbl>
## 1 Claire      10   165.
## 2 Cory         5   188.
## 3 Rachael      8   159.
food_tibble<-tibble::tibble(
  names=c("Claire","Rachael","Claus","Eduardo","Dariya"),
  r_abil=c(10,8,11,7,7),
  fave_fruit=c("grapes","apples","kiwi","plum","tomato")
  )

#Keep all data from left tibble
OCH_tibble%>%
  dplyr::left_join(food_tibble)
## Joining, by = c("names", "r_abil")
## # A tibble: 3 x 4
##   names   r_abil height fave_fruit
##   <chr>    <dbl>  <dbl> <chr>     
## 1 Claire      10   165. grapes    
## 2 Cory         5   188. <NA>      
## 3 Rachael      8   159. apples
knitr::include_graphics("animated-left-join.gif")

#Keep all data from right tibble
right_join(OCH_tibble,food_tibble)
## Joining, by = c("names", "r_abil")
## # A tibble: 5 x 4
##   names   r_abil height fave_fruit
##   <chr>    <dbl>  <dbl> <chr>     
## 1 Claire      10   165. grapes    
## 2 Rachael      8   159. apples    
## 3 Claus       11    NA  kiwi      
## 4 Eduardo      7    NA  plum      
## 5 Dariya       7    NA  tomato
knitr::include_graphics("animated-right-join.gif")

#Keep all data
full_join(OCH_tibble,food_tibble)
## Joining, by = c("names", "r_abil")
## # A tibble: 6 x 4
##   names   r_abil height fave_fruit
##   <chr>    <dbl>  <dbl> <chr>     
## 1 Claire      10   165. grapes    
## 2 Cory         5   188. <NA>      
## 3 Rachael      8   159. apples    
## 4 Claus       11    NA  kiwi      
## 5 Eduardo      7    NA  plum      
## 6 Dariya       7    NA  tomato
knitr::include_graphics("animated-full-join.gif")

#Keep all data with observations in both tibbles
inner_join(OCH_tibble,food_tibble)
## Joining, by = c("names", "r_abil")
## # A tibble: 2 x 4
##   names   r_abil height fave_fruit
##   <chr>    <dbl>  <dbl> <chr>     
## 1 Claire      10   165. grapes    
## 2 Rachael      8   159. apples
knitr::include_graphics("animated-inner-join.gif")

#Filter out values from left tibble that aren't in right tibble
semi_join(OCH_tibble,food_tibble)
## Joining, by = c("names", "r_abil")
## # A tibble: 2 x 3
##   names   r_abil height
##   <chr>    <dbl>  <dbl>
## 1 Claire      10   165.
## 2 Rachael      8   159.
knitr::include_graphics("animated-semi-join.gif")

#Keep data that is unique to one tibble
anti_join(OCH_tibble,food_tibble)
## Joining, by = c("names", "r_abil")
## # A tibble: 1 x 3
##   names r_abil height
##   <chr>  <dbl>  <dbl>
## 1 Cory       5   188.
knitr::include_graphics("animated-anti-join.gif")

Other Packages

Tidyr Pivot

Pivot_wide and pivot_long (formerly spread and gather) are helpful for formatting data so it can be processed or visualized more easily

#Pivot_wider and pivot_longer replaced spread 
#and gather in the newest version of tidyr.
#Code using the old functions is commented
#out below alongside the new functions

#Spread makes data wider
Gap_minder%>%
  select(country, pop, year)%>%
  tidyr::pivot_wider(names_from = year,
                     values_from = pop)
## # A tibble: 142 x 13
##    country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
##    <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Afghan… 8.43e6 9.24e6 1.03e7 1.15e7 1.31e7 1.49e7 1.29e7 1.39e7 1.63e7 2.22e7
##  2 Albania 1.28e6 1.48e6 1.73e6 1.98e6 2.26e6 2.51e6 2.78e6 3.08e6 3.33e6 3.43e6
##  3 Algeria 9.28e6 1.03e7 1.10e7 1.28e7 1.48e7 1.72e7 2.00e7 2.33e7 2.63e7 2.91e7
##  4 Angola  4.23e6 4.56e6 4.83e6 5.25e6 5.89e6 6.16e6 7.02e6 7.87e6 8.74e6 9.88e6
##  5 Argent… 1.79e7 1.96e7 2.13e7 2.29e7 2.48e7 2.70e7 2.93e7 3.16e7 3.40e7 3.62e7
##  6 Austra… 8.69e6 9.71e6 1.08e7 1.19e7 1.32e7 1.41e7 1.52e7 1.63e7 1.75e7 1.86e7
##  7 Austria 6.93e6 6.97e6 7.13e6 7.38e6 7.54e6 7.57e6 7.57e6 7.58e6 7.91e6 8.07e6
##  8 Bahrain 1.20e5 1.39e5 1.72e5 2.02e5 2.31e5 2.97e5 3.78e5 4.55e5 5.29e5 5.99e5
##  9 Bangla… 4.69e7 5.14e7 5.68e7 6.28e7 7.08e7 8.04e7 9.31e7 1.04e8 1.14e8 1.23e8
## 10 Belgium 8.73e6 8.99e6 9.22e6 9.56e6 9.71e6 9.82e6 9.86e6 9.87e6 1.00e7 1.02e7
## # … with 132 more rows, and 2 more variables: `2002` <dbl>, `2007` <dbl>
  #tidyr::spread(key=year,value=pop)
  
#Gather makes data longer
Gap_minder%>%
  select(country, pop, year)%>%
  tidyr::pivot_wider(names_from = year,
                     values_from = pop)%>%
  tidyr::pivot_longer(-country,
                      names_to="year",
                      values_to="pop"
                      )
## # A tibble: 1,704 x 3
##    country     year       pop
##    <chr>       <chr>    <dbl>
##  1 Afghanistan 1952   8425333
##  2 Afghanistan 1957   9240934
##  3 Afghanistan 1962  10267083
##  4 Afghanistan 1967  11537966
##  5 Afghanistan 1972  13079460
##  6 Afghanistan 1977  14880372
##  7 Afghanistan 1982  12881816
##  8 Afghanistan 1987  13867957
##  9 Afghanistan 1992  16317921
## 10 Afghanistan 1997  22227415
## # … with 1,694 more rows
  # tidyr::spread(key=year,value=pop)%>%
  # tidyr::gather(key=year,value=pop,-country)

Tidyr Separate_Unite

Separate and unite are used to combine text across columns

food_tibble%>%
  tidyr::unite(name_foods,names,fave_fruit)
## # A tibble: 5 x 2
##   name_foods     r_abil
##   <chr>           <dbl>
## 1 Claire_grapes      10
## 2 Rachael_apples      8
## 3 Claus_kiwi         11
## 4 Eduardo_plum        7
## 5 Dariya_tomato       7
food_tibble%>%
  tidyr::unite(name_foods,names,fave_fruit)%>%
  tidyr::separate(name_foods,c("names","fave_fruit"))
## # A tibble: 5 x 3
##   names   fave_fruit r_abil
##   <chr>   <chr>       <dbl>
## 1 Claire  grapes         10
## 2 Rachael apples          8
## 3 Claus   kiwi           11
## 4 Eduardo plum            7
## 5 Dariya  tomato          7

Stringr

The stringr package has functions for processing text. Many of these take advantage of regular expressions (regex) which can be used to match complex patterns in a string variable.

#str_detect can be used to filter for a pattern. It returns a
#logical value (TRUE/FALSE)
Gap_minder%>%
  filter(year==1992,
         stringr::str_detect(country,"Al"))
## # A tibble: 2 x 6
##   country continent  year lifeExp      pop gdpPercap
##   <chr>   <chr>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 Albania Europe     1992    71.6  3326498     2497.
## 2 Algeria Africa     1992    67.7 26298373     5023.
#str_extract returns matches from a string
Gap_minder%>%
  mutate(short_cont=
           stringr::str_extract(continent,"[a-zA-Z]{3}")%>%
           toupper())
## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap short_cont
##    <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl> <chr>     
##  1 Afghanistan Asia       1952    28.8  8425333      779. ASI       
##  2 Afghanistan Asia       1957    30.3  9240934      821. ASI       
##  3 Afghanistan Asia       1962    32.0 10267083      853. ASI       
##  4 Afghanistan Asia       1967    34.0 11537966      836. ASI       
##  5 Afghanistan Asia       1972    36.1 13079460      740. ASI       
##  6 Afghanistan Asia       1977    38.4 14880372      786. ASI       
##  7 Afghanistan Asia       1982    39.9 12881816      978. ASI       
##  8 Afghanistan Asia       1987    40.8 13867957      852. ASI       
##  9 Afghanistan Asia       1992    41.7 16317921      649. ASI       
## 10 Afghanistan Asia       1997    41.8 22227415      635. ASI       
## # … with 1,694 more rows
#str_replace_all can be used to swap one pattern for another
Gap_minder%>%
  mutate(new_country=
           stringr::str_replace_all(country,"[aeiou]","-"))
## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap new_country
##    <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl> <chr>      
##  1 Afghanistan Asia       1952    28.8  8425333      779. Afgh-n-st-n
##  2 Afghanistan Asia       1957    30.3  9240934      821. Afgh-n-st-n
##  3 Afghanistan Asia       1962    32.0 10267083      853. Afgh-n-st-n
##  4 Afghanistan Asia       1967    34.0 11537966      836. Afgh-n-st-n
##  5 Afghanistan Asia       1972    36.1 13079460      740. Afgh-n-st-n
##  6 Afghanistan Asia       1977    38.4 14880372      786. Afgh-n-st-n
##  7 Afghanistan Asia       1982    39.9 12881816      978. Afgh-n-st-n
##  8 Afghanistan Asia       1987    40.8 13867957      852. Afgh-n-st-n
##  9 Afghanistan Asia       1992    41.7 16317921      649. Afgh-n-st-n
## 10 Afghanistan Asia       1997    41.8 22227415      635. Afgh-n-st-n
## # … with 1,694 more rows
#str_detect can also be used to conditionally mutate 
labelled<-Gap_minder%>%
  filter(year==1992 & continent=="Europe")%>%
  mutate(country_label=if_else(
           stringr::str_detect(country,"Albania|France"),
           country,""))

ggplot(data=labelled,aes(x=lifeExp,y=gdpPercap,fill=continent),colour="Blue")+
  geom_point()+
  ggtitle("European Countries Per Capita GDP 1992")+
  geom_text_repel(aes(x=lifeExp,y=gdpPercap,label=country_label))+
  xlim(c(65,80))+
  ylim(c(2000,35000))

Forcats

Factors are text-based data that typically involve multiple observtions of the same text. R stores factors as integers, and this sometimes complicates figures and data analysis. Forcats provides functions the handle factors.

#era is a factor here, but the order is off because R sorts 
#00s before 50s
ggplot(Decades,aes(x=era,y=lifeExp))+
  ggtitle("Life Expectancy by Decade")+
  geom_boxplot()+
  ggplot2::geom_jitter(aes(colour=continent,alpha=0.3),
              show.legend = FALSE)

#We can reorder the variable using fct_reorder!
Decades<-Decades%>%
  mutate(era=fct_reorder(era,year))

#You can also do the same manually with fct_relevel

# Decades<-Decades%>%
#   mutate(era=fct_relevel(era,"00s",after=Inf))

ggplot(Decades,aes(x=era,y=lifeExp))+
  ggtitle("Life Expectancy by Decade (Corrected)")+
  geom_boxplot()+
  ggplot2::geom_jitter(aes(colour=continent,alpha=0.3),
              show.legend = FALSE)

Helpful Info

Packages

You can find a package to solve any problem but here are a few ones that I commonly use in my workflow:

Science Related

  • DESeq2 - Differential expression analysis (RNAseq)
  • pcr - Analyze qPCR data
  • seqinr - Work with biological data (fasta, fastq)
  • Rsamtools - Work with BAM and SAM files
  • GenomicRanges - Helpful for loading some biological data (GFF files)

Figure Making

  • ggforce - Adds more functionality to ggplot
  • cowplot - Makes ggplot output look nicer
  • colorspace - Manage colors
  • RColorBrewer - Incorporate better color palettes

Data input/processing

  • data.table - Load and manipulate large data files much faster than tidyverse
  • broom - Cleanup output from various r-functions (t.test, linear model)
  • Reticulate - Integrate Python code with R code!
  • xlsx - Load data from Excel files
  • readxl - Load data from Excel files